[Intel NPU] Add Windows & Linux Intel NPU support by Looong01 · Pull Request #1171 · lightvector/KataGo

Looong01 · 2026-03-16T18:49:27Z

Summary

This PR adds and hardens the Windows & Linux Intel NPU path for KataGo using the ONNX backend with ONNX Runtime + OpenVINO Execution Provider, and updates docs/config guidance for an end-to-end workflow.

It also improves failure behavior for non-ONNX builds and simplifies Windows & Linux dependency handling.

What Changed

1) ONNX backend and OpenVINO provider support

Added/updated ONNX Runtime provider selection via onnxProvider (cpu, openvino, cuda, tensorrt, migraphx, coreml).
Added/updated OpenVINO-specific runtime options:
- onnxOpenVINODeviceType
- onnxOpenVINODeviceId
- onnxOpenVINOCacheDir
- onnxOpenVINOEnableNPUFastCompile (best-effort; depends on ORT build support)
Supports both:
- loading raw .onnx models directly
- loading .bin/.bin.gz models via internal conversion to ONNX graph

2) `exportonnx` command behavior

exportonnx is available in ONNX builds and exports fixed-size ONNX models.
Default export board size is 19x19 (-x/-y can override).
In non-ONNX builds, exportonnx now returns a clear error instead of failing ambiguously.

3) Config safety for non-ONNX binaries

In non-ONNX builds, forcing onnx* config keys now fails fast with a clear message.
Prevents silent misconfiguration when users accidentally pass ONNX-only config into CUDA/OpenCL/Eigen/etc builds.

4) CMake dependency flow

Kept ONNX runtime root wiring via ONNXRUNTIME_ROOT (defaulting to cpp/external/onnxruntime-win-x64-openvino and cpp/external/onnxruntime-linux-x64-openvino).
Added/updated automatic dependency fetch flow for Windows & Linux builds (zlib, onnx, protobuf) through vcpkg when enabled.
ONNX runtime DLLs or SOs are copied to output dir during build on Windows or Linux.

5) Documentation updates

Compiling.md:
- Added explicit Windows & Linux Intel NPU setup steps:
  - Visual Studio Community or VS 2026 Build Tools (Desktop C++)
  - Intel NPU driver install
  - OpenVINO archive install
  - ONNX Runtime build with OpenVINO EP (use_openvino=NPU)
- Added the exact file-copy checklist into cpp/external/onnxruntime-win-x64-openvino.
- Added minimal ONNX backend build command.
README.md:
- Added Intel NPU quick-start section for ONNX/OpenVINO.
- Added minimal commands for:
  - exportonnx (default 19x19)
  - benchmark
  - gtp

Behavior Notes

Multi-device mapping (onnxDeviceToUseThread*) is mainly intended for ONNX providers like CUDA/TensorRT/MIGraphX.
OpenVINO Intel NPU usage is typically single-device.

Validation

ONNX build compiles successfully on Windows & Linuix.
exportonnx works from .bin/.bin.gz -> .onnx.
benchmark/gtp run with onnxProvider=openvino and onnxOpenVINODeviceType=NPU.
Non-ONNX binaries now correctly reject ONNX-only config keys.

Looong01 · 2026-03-16T18:50:44Z

This is screenshot of Sabaki testing:

And the binary release here: https://github.com/Looong01/KataGo-Multi-backends/releases/tag/v1.16.4-openvino

Looong01 · 2026-03-16T18:58:09Z

I partially referenced the code from #1164, and I am very grateful to @ChinChangYang

Looong01 · 2026-03-16T23:31:39Z

Add Linux support:

foxrainowo · 2026-03-18T07:25:49Z

This is a wonderful work! I will test this backend in a few days.

Looong01 · 2026-03-18T07:47:32Z

I will implement AMD NPU backend in days.

foxrainowo · 2026-03-20T18:33:32Z

@Looong01

I conducted some tests on my device with no issues, successfully calling the Intel NPU:
Using b28c512nbt, the GPU speed was 18–23 visits/s, and the NPU speed was 55–70 visits/s. That’s 2.7 to 3 times faster.
For multi-network matches, the speed reached 1.8 times the original.

I’ve come to a preliminary conclusion: the NPU backend should not be configured with multi-threading. Its initialization time depends on the number of threads set—the more threads, the longer the wait. On the other hand, multi-threading actually slows down the computation speed. For single-game analysis, I use a single thread because it is the fastest and offers the best quality (as shown in the figure below, the speed of thread 12 is very slow beacuse of the initialization). For multi-network matches, I set it to “run two games simultaneously” because running too many games at once slows down the speed and reduces performance.

Do you expect this backend to affect accuracy, or are there any comparative tests on this?
During initialization, it generates many blob files. What are these blob files?
What is the function of these parameters? Can they be automated, and is it necessary for users to modify them?
onnxInputSpatial = input_spatial
onnxInputGlobal = input_global
onnxInputMeta = input_meta
onnxOutputPolicy = out_policy
onnxOutputValue = out_value
onnxOutputMiscvalue = out_miscvalue
onnxOutputOwnership = out_ownership
onnxModelVersion = 15

Looong01 · 2026-03-20T18:40:09Z

@Looong01

I conducted some tests on my device with no issues, successfully calling the Intel NPU: Using b28c512nbt, the GPU speed was 18–23 visits/s, and the NPU speed was 55–70 visits/s. That’s 2.7 to 3 times faster. For multi-network matches, the speed reached 1.5 to 1.8 times the original.

I’ve come to a preliminary conclusion: the NPU backend should not be configured with multi-threading. Its initialization time depends on the number of threads set—the more threads, the longer the wait. On the other hand, multi-threading actually slows down the computation speed. For single-game analysis, I use a single thread because it is the fastest and offers the best quality (as shown in the figure below, the speed of thread 12 is very slow beacuse of the initialization). For multi-network matches, I set it to “run two games simultaneously” because running too many games at once slows down the speed and reduces performance.

Do you expect this backend to affect accuracy, or are there any comparative tests on this?

During initialization, it generates many blob files. What are these blob files?

What is the function of these parameters? Can they be automated, and is it necessary for users to modify them?
onnxInputSpatial = input_spatial
onnxInputGlobal = input_global
onnxInputMeta = input_meta
onnxOutputPolicy = out_policy
onnxOutputValue = out_value
onnxOutputMiscvalue = out_miscvalue
onnxOutputOwnership = out_ownership
onnxModelVersion = 15

Thank u for your test.

No. I do lots of tests and this backend will NOT affect accuracy.
Blob files are the compiling cache of NPU. Because the model need to be compiled for the first time if you want to use NPU. It just like any model running on NPU. And it also just like TensorRT backend and generate some cache files.
These are some underlying engine configuration parameters. Users will not use it in general. But it is useful to do debugging.

foxrainowo · 2026-03-21T00:44:26Z

Thank you!

I am concerned about the poor performance of multi-threading. As shown in the figure, when the number of threads increases, the computation speed actually decreases. Is this because the NPU itself is not suitable for multi-threading, or is it still possible to optimize multi-threading at this stage?

kaorahi · 2026-03-21T13:23:15Z

This is amazing on my Linux notebook. I am seeing a 3.5x speedup (87.30 vs 25.16 visits/s) compared to OpenCL, which seems unusually slow on my system. I really appreciate this. As for katago benchmark, it recommends numSearchThreads = 1 in my case as well.

To build ONNX Runtime, I had to downgrade gcc-15 to gcc-14.

CC=gcc-14 CXX=g++-14 CMAKE_PREFIX_PATH=/usr/lib/cmake/openvino2026.0.0 ./build.sh --config Release --use_openvino NPU --build_shared_lib --skip_tests

Also, the source directories seem different from the document, so I used the following commands in zsh.

cd ~/katago/
mkdir -p cpp/external/onnxruntime-linux-x64-openvino/{include,lib/{cmake/onnxruntime,pkgconfig}}
cd cpp/external/onnxruntime-linux-x64-openvino
cp -r ~/onnxruntime/include/onnxruntime/core include/
cp ~/onnxruntime/include/onnxruntime/**/{cpu_provider_factory.h,provider_options.h,onnxruntime_c_api.h,onnxruntime_cxx_api.h,onnxruntime_cxx_inline.h,onnxruntime_env_config_keys.h,onnxruntime_ep_c_api.h,onnxruntime_ep_device_ep_metadata_keys.h,onnxruntime_float16.h,onnxruntime_lite_custom_op.h,onnxruntime_run_options_config_keys.h,onnxruntime_session_options_config_keys.h} include/
cp ~/onnxruntime/build/Linux/Release/**/{libonnxruntime_providers_openvino.so,libonnxruntime_providers_shared.so,libonnxruntime.so.1.*,libonnxruntime.so.1,libonnxruntime.so} lib/
cp ~/onnxruntime/build/Linux/Release/**/{onnxruntimeConfig.cmake,onnxruntimeConfigVersion.cmake,onnxruntimeTargets.cmake,onnxruntimeTargets-release.cmake} lib/cmake/onnxruntime/
cp ~/onnxruntime/build/Linux/Release/**/libonnxruntime.pc lib/pkgconfig/

ChinChangYang · 2026-03-24T12:25:17Z

Claude detects an issues in a Docker container.

Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined

Error message:

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;

Root cause:

onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.

Reproduction steps:

# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)

System environment:

Item	Value
OS	Linux aarch64
Compiler	GCC 15.2.0
ONNX Runtime	v1.21.0
protobuf	3.21.12 (ORT bundled)
cmake	4.2.3

Fix:

In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:

-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>

onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.

Looong01 · 2026-03-25T13:22:14Z

Claude detects an issues in a Docker container.

Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined

Error message:
/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;
Root cause:

onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.

Reproduction steps:
# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)
System environment:

Item Value
OS Linux aarch64
Compiler GCC 15.2.0
ONNX Runtime v1.21.0
protobuf 3.21.12 (ORT bundled)
cmake 4.2.3
Fix:

In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:
-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>
onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.

Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.

"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".

Or, do u still think I need to do this change?

Looong01 · 2026-03-25T13:25:00Z

Thank you!

I am concerned about the poor performance of multi-threading. As shown in the figure, when the number of threads increases, the computation speed actually decreases. Is this because the NPU itself is not suitable for multi-threading, or is it still possible to optimize multi-threading at this stage?

Bcs NPU is different arch(totally different from GPU or CPU), single threading is enough for it.

ChinChangYang · 2026-03-25T13:32:50Z

Claude detects an issues in a Docker container.
Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined
Error message:
/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;
Root cause:
onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.
Reproduction steps:
# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)
System environment:
Item Value
OS Linux aarch64
Compiler GCC 15.2.0
ONNX Runtime v1.21.0
protobuf 3.21.12 (ORT bundled)
cmake 4.2.3
Fix:
In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:
-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>
onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.
Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.

"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".

Or, do u still think I need to do this change?

I think you misunderstood my comment. The reproduction steps fetch #1171, exact this PR, not mine.

Looong01 · 2026-03-25T15:47:48Z

Claude detects an issues in a Docker container.
Bug: onnxmodelbuilder.cpp fails to compile on Linux/GCC — ONNX_API macro undefined
Error message:
/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:42:17:
error: variable 'ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto' has initializer but incomplete type
   42 | struct ONNX_API TableStruct_onnx_2fonnx_2dml_2eproto {
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:43:3:
error: expected primary-expression before 'static'
   43 |   static const uint32_t offsets[];

/path/to/_deps/onnx-build/onnx/onnx-ml.pb.h:48:51:
error: expected initializer before '_AttributeProto_default_instance_'
   48 | ONNX_API extern AttributeProtoDefaultTypeInternal _AttributeProto_default_instance_;
Root cause:
onnxmodelbuilder.cpp includes <onnx/onnx-ml.pb.h> directly, bypassing onnx/onnx_pb.h which defines the ONNX_API macro. When ONNX_API is undefined, the compiler treats it as an identifier rather than an attribute specifier, breaking the struct/extern declarations in the generated protobuf header.
Reproduction steps:
# 1. Clone and checkout this PR branch
git clone https://github.com/lightvector/KataGo.git
cd KataGo
git fetch origin pull/1171/head:pr-1171
git checkout pr-1171

# 2. Download ORT prebuilt + build onnx_proto and protobuf-lite from source
#    (ONNXRUNTIME_ROOT = prebuilt ORT package dir)
#    (ONNX_INCLUDE_DIR = ort-build/_deps/onnx-build)
#    (ONNX_PROTO_LIB   = ort-build/_deps/onnx-build/libonnx_proto.a)
#    (PROTOBUF_INCLUDE_DIR = ort-build/_deps/protobuf-src/src)
#    (PROTOBUF_LIB     = ort-build/_deps/protobuf-build/libprotobuf-lite.a)

# 3. Configure
mkdir build && cd build
cmake ../cpp \
  -DUSE_BACKEND=ONNX \
  -DKATAGO_AUTO_FETCH_DEPS=OFF \
  -DONNXRUNTIME_ROOT=<ort-prebuilt-dir> \
  -DONNX_INCLUDE_DIR=<ort-build>/_deps/onnx-build \
  -DONNX_PROTO_LIB=<ort-build>/_deps/onnx-build/libonnx_proto.a \
  -DPROTOBUF_INCLUDE_DIR=<ort-build>/_deps/protobuf-src/src \
  -DPROTOBUF_LIB=<ort-build>/_deps/protobuf-build/libprotobuf-lite.a \
  -DCMAKE_CXX_FLAGS="-DONNX_ML"

# 4. Build → fails at onnxmodelbuilder.cpp
cmake --build . -j$(nproc)
System environment:
Item Value
OS Linux aarch64
Compiler GCC 15.2.0
ONNX Runtime v1.21.0
protobuf 3.21.12 (ORT bundled)
cmake 4.2.3
Fix:
In cpp/neuralnet/onnxmodelbuilder.cpp, change line 9:
-#include <onnx/onnx-ml.pb.h>
+#include <onnx/onnx_pb.h>
onnx_pb.h defines ONNX_API before including onnx-ml.pb.h, resolving the macro issue. Note: when using the new ONNX_INCLUDE_DIR cmake variable, onnx_pb.h must also be present in that directory (it lives in the ONNX source tree, not the build output). See also: ChinChangYang/KataGo#18.
Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.
"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".
Or, do u still think I need to do this change?
I think you misunderstood my comment. The reproduction steps fetch #1171, exact this PR, not mine.

But I don't meet any error when I compile it. Maybe only happen with GCC-15?

ChinChangYang · 2026-03-26T01:40:02Z

Actually, my CMakeLists.txt deal with it well. I use vcpkg to deal with this deps.

"https://github.com/Looong01/KataGo-Multi-backends/blob/115e6daba5f8063fd70d7c89631f123cccced902/cpp/CMakeLists.txt".

Or, do u still think I need to do this change?

I think you misunderstood my comment. The reproduction steps fetch #1171, exact this PR, not mine.

But I don't meet any error when I compile it. Maybe only happen with GCC-15?

11433e6 resolves the issue. Thanks.

kaorahi · 2026-04-19T05:03:16Z

This is amazing on my Linux notebook. I am seeing a 3.5x speedup (87.30 vs 25.16 visits/s) compared to OpenCL, which seems unusually slow on my system.

This has been working perfectly for the past month. It would be great to have this feature merged into the official KataGo. Without it, I would have almost had to give up on KataGo after moving to my new PC. Thank you again, @Looong01.

On the Intel Core Ultra 7 255U, OpenCL KataGo is sadly slow, running at less than half the speed of a 5-year-old system with a Core i7-1165G7.

Looong01 · 2026-04-19T05:12:07Z

This is amazing on my Linux notebook. I am seeing a 3.5x speedup (87.30 vs 25.16 visits/s) compared to OpenCL, which seems unusually slow on my system.

This has been working perfectly for the past month. It would be great to have this feature merged into the official KataGo. Without it, I would have almost had to give up on KataGo after moving to my new PC. Thank you again, @Looong01.

On the Intel Core Ultra 7 255U, OpenCL KataGo is sadly slow, running at less than half the speed of a 5-year-old system with a Core i7-1165G7.

@lightvector

lightvector · 2026-04-20T15:30:55Z

Thanks, I'll also look at this soon.

Looong01 · 2026-04-20T18:56:52Z

Thanks, I'll also look at this soon.

Thanks!

kaorahi · 2026-05-29T10:51:41Z

Thank you for the updates. b37aa25 works fine with the following minor corrections to the ONNX Runtime Backend (Linux) section of Compiling.md.

onnxruntime-win-x64-openvino ==> onnxruntime-linux-x64-openvino
build\Linux\Release ==> build/Linux/Release

In my environment, I also needed to downgrade GCC when running ./build.sh:

CC=gcc-14 CXX=g++-14 ./build.sh ...

At the moment, this is the only branch that runs fast enough for practical use in my environment. I would appreciate official support for this.

foxrainowo · 2026-06-17T09:38:28Z

@Looong01 I don't know if this is a problem with the original or with OpenVino.

Looong01 · 2026-06-17T18:19:17Z

@Looong01 I don't know if this is a problem with the original or with OpenVino.

This is a DEVICE_LOST from the Intel NPU that occurred mid-inference, after roughly 6 hours of self-play (~22,950 games).
The core error:
L0 zeCommandQueueExecuteCommandLists result: ZE_RESULT_ERROR_DEVICE_LOST,
code 0x70000001 – device hung, reset, was removed, or driver update occurred
L0 refers to Level Zero — the OpenVINO intel_npu plugin talks to the NPU through the Level Zero API. The call chain is:
ONNX Runtime → OpenVINO EP (ov_interface.cc:28) → intel_npu plugin (infer_request.cpp:224) → Level Zero (zero_wrappers.cpp) → device lost.
Both errors (subgraph_4 and subgraph_3) have nearly identical timestamps (15:06:27.2041348 and .2041655, ~30 µs apart), which indicates this is not a problem with any individual subgraph — the entire NPU device dropped at that instant, so every subgraph running at the time failed simultaneously.
Why DEVICE_LOST is triggered
0x70000001 is a fairly generic device-level error. The likely causes, ordered by probability for this scenario:

Driver updated mid-run (most likely)

The error message itself says "or driver update occurred." Windows Update silently pushes Intel NPU driver updates in the background. If WU updated the NPU driver during the 6-hour run, the device gets re-enumerated and all existing Level Zero contexts/handles are invalidated, causing in-flight inference to fail with device lost.
2. Long-running resource leak / handle accumulation

This is ~22,950 games with many inferences each — a very large volume. If the intel_npu plugin or this OpenVINO version leaks memory or handles when repeatedly creating/destroying infer requests, accumulation past some threshold can hang the NPU firmware, triggering a GPU-TDR-style reset. The fact that it crashed after 6 hours rather than immediately is consistent with an accumulation-type issue.
3. NPU firmware/driver hang (TDR)

A single inference stalls past the watchdog timeout, the NPU is force-reset, and all subsequent command-queue submissions fail.
4. Thermal/power-induced reset

Possible under sustained load, but NPU power draw is low, so this is the least likely.
Suggested investigation
First, rule out the simplest cause — a driver update:
powershellGet-WinEvent -LogName System | Where-Object {
$_.Message -match "NPU|Intel.*AI Boost|driver"
} | Select-Object TimeCreated, Id, Message -First 20
Focus on whether there were any driver-install / device re-enumeration events around 15:06. You can also check Get-WindowsUpdateLog or the Update history in Settings.
If a driver update is ruled out, other directions:

Add auto-restart + error recovery to the self-play loop. This is the most practical fix: after a device lost, the current process generally has to rebuild the ONNX Runtime session (re-initialize the Level Zero context) — simply catching the exception and continuing will likely fail on all subsequent inferences. The most robust approach is to have an outer script detect this error code, kill the process, and relaunch it, resuming from the last SGF/checkpoint.
Disable automatic driver updates for the NPU to prevent long-running jobs from being interrupted (disable auto-update for the device in Device Manager, or pause Windows Update).
Upgrade OpenVINO / the NPU driver to the latest stable version and re-run, to check whether the leak has been fixed. Reproducing with a short high-frequency stress test is more efficient than blindly running for 6 hours.
If you suspect a leak, monitor NPU memory usage during the run to see whether it grows monotonically.

Looong01 and others added 26 commits July 28, 2025 20:13

Add ROCm backend

1f2ae46

Fix bugs

b455530

Update

8b30cb9

Fix bugs

570ced0

Fix bugs

abb6124

All bug fixed

bfb292e

Update

4606424

test new method

1e8ea78

Update

c1a09cf

Test finished

0957b88

Update docks

c70d841

Update gitignore

1d05ca8

Update new method

9d4662b

Optimize performance

d40bd50

Update new Convlayer method

158d24d

Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm

ec32eb1

Add new compile target

0bfe0a1

Merge branch 'lightvector:master' into master

f5fbb33

Add ROCm for Windows support

26d8c5b

Merge branch 'lightvector:master' into master

555d2f1

Merge branch 'lightvector:master' into master

dbc7cfa

Fix bugs

ed396b7

Merge branch 'lightvector:master' into master

ccec62c

Add Intel NPU support

ce2c9fc

Resume gitignore

d828f21

Edit README.md and Compiling.md

358dd84

Looong01 added 2 commits March 17, 2026 07:29

Add Linux Intel NPU support

496bb96

Edit Compiling.md

115e6da

Looong01 changed the title ~~[Intel NPU] Add Windows Intel NPU support~~ [Intel NPU] Add Windows & Linux Intel NPU support Mar 16, 2026

Fix a bug

11433e6

Looong01 and others added 2 commits April 19, 2026 06:31

Merge branch 'lightvector:master' into Intel_NPU

f467227

Remove AMD GPU support

ffc3fef

Update openvino to 2026.1 and onnxruntime to 1.24.4

6a6a8b1

Looong01 and others added 4 commits May 12, 2026 01:14

Fix a bug

adcbf04

Merge branch 'master' into Intel_NPU

bd851d0

Update ONNX backend for Intel NPU

70dcf8d

Update ONNX backend

b37aa25

Conversation

Looong01 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

1) ONNX backend and OpenVINO provider support

2) exportonnx command behavior

3) Config safety for non-ONNX binaries

4) CMake dependency flow

5) Documentation updates

Behavior Notes

Validation

Uh oh!

Looong01 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Looong01 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Looong01 commented Mar 16, 2026

Uh oh!

foxrainowo commented Mar 18, 2026

Uh oh!

Looong01 commented Mar 18, 2026

Uh oh!

foxrainowo commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Looong01 commented Mar 20, 2026

Uh oh!

foxrainowo commented Mar 21, 2026

Uh oh!

kaorahi commented Mar 21, 2026

Uh oh!

ChinChangYang commented Mar 24, 2026

Uh oh!

Looong01 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Looong01 commented Mar 25, 2026

Uh oh!

ChinChangYang commented Mar 25, 2026

Uh oh!

Looong01 commented Mar 25, 2026

Uh oh!

ChinChangYang commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaorahi commented Apr 19, 2026

Uh oh!

Looong01 commented Apr 19, 2026

Uh oh!

lightvector commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

kaorahi commented May 29, 2026

Uh oh!

foxrainowo commented Jun 17, 2026

Uh oh!

Looong01 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Looong01 commented Mar 16, 2026 •

edited

Loading

2) `exportonnx` command behavior

Looong01 commented Mar 16, 2026 •

edited

Loading

Looong01 commented Mar 16, 2026 •

edited

Loading

foxrainowo commented Mar 20, 2026 •

edited

Loading

Looong01 commented Mar 25, 2026 •

edited

Loading

ChinChangYang commented Mar 26, 2026 •

edited

Loading